For the final exam/project we will develop classification models using several approaches and compare their performance on a new dataset that is a subset of one of the datasets used in machine learning common task framework (CTF) competitions. A copy of it split into training (final-data-train.csv, with outcome yn available) and test (final-data-test.csv, stripped of the outcome, for prediction purposes only) datasets is available on our course website in Canvas as a zip-archive of all associated files.
Please notice, that at the end of this final exam/project you will be asked, in addition to the Rmarkdown and HTML files, to also make predictions for the observations in the test (not training!) dataset and upload them into Canvas as well. The expected format for the file with predictions for test dataset is two columns of comma-separated values, one row per observation in test dataset, first column – the observation identifier (column id in test dataset) and the second column – your best model predictions for each observation in the test dataset as yes/no indicator. To illustrate expected format the zip archive contains also a couple of examples of test predictions in this format for your reference as well (predictions-*.csv files in predictions-examples sub-folder in zip-archive).
One more time, to iterate and emphasize, please notice that this time your submission must consist of the following three (not just two, Rmd+html, as usual) items:
The teaching team invites you to load your predictions (just predictions, just for the test dataset according to the file format shown in the sample files in the zip-archive) into Canvas repeatedly as you work on your models and improve them over the course of this week. At least daily (or more frequently as we see fit) we will download predictions loaded by everyone by that time and compile a leaderboard in html format of all of them sorted by their accuracy as compared to the true values of the outcome for the test dataset (along with their sensitivity, specificity, etc.). This list will be made available on our course website in Canvas for everyone in this class in order to see how the performance of their models compares across the rest of the models built by other students in the class. The first version of the leaderboard posted on the course website at the time when final exam is made available starts with predictions made by those few example files provided in the zip-archive (coin flip, majority vote, etc. – what do you think Charlie Brown is using to make the predictions?). Those should be pretty easy to improve upon.
It is 100% up to you whether you want to upload your model predictions over the course of this week, how frequently you want to do it and what you want its results to be called in the leaderboard posted for everyone in the class to see. We will use the name of the 2nd column in the file with predictions (the one containing yes/no values, not the numerical ids of the observations) as the model name listed in the leaderboard. If you prefer not to use your name, choose something else instead, sufficiently unique so that it is easier for you to spot your result among all others. Once again, please check out sample files of dull (majority vote, coin flip, etc.) predictions we have made available and consider how they show up in the leaderboard html file already posted on Canvas website. Once you are done with final you are expected to load predictions from your best model into Canvas – that is part of your points total as explained below.
Lastly, the size of this dataset can make some of the modeling techniques run slower than what we were typically encountering in this class. You may find it helpful to do some of the exploration and model tuning on multiple random samples of smaller size as you decide on useful ranges of parameters/modeling choices, and then only perform a final run of fully debugged and working code on the full dataset. Please see also the afterword below on the computational demands of this problem set.
Before we get started, I’m going to define a few useful functions for later.
calcAcc <- function (cm) {
sum(diag(cm)) / sum(cm)
}
calcSpecificity <- function (cm) {
cm[1, 1] / sum(cm[, 1])
}
calcSensitivity <- function (cm) {
if (ncol(cm) == 2) {
cm[2, 2] / sum(cm[, 2])
}
else {
0
}
}
Download and read training and test data into R and prepare graphical and numerical summaries of it: e.g. histograms of continuous attributes, contingency tables of categorical variables, scatterplots of continuous attributes with some of the categorical variables indicated by color/symbol shape, etc. Whatever you find helpful to think about properties of the data you are about to start using for fitting classification models.
As it is often the case for such contests, the atributes in the dataset are blinded in the sense that no information is available about what those are or what their values mean. The only information available is that the attribute
ynis the outcome to be modeled and the attributeidis the unique numerical identifier for each observation. Some of the remaining attributes are clearly categorical (those that are character valued) and some rather obviously continuous (those with numerical values with large number of unique values). For several of them it is less clear whether it is best to treat them as continuous or categorical – e.g. their values are numerical but there are relatively few unique values with many observations taking the same value, so that they arguably could be treated as continuous or categorical. Please idenify them, reflect on how you prefer to handle them and describe this in your own words.
Perform principal components analysis of this data (do you need to scale it prior to that? how would you represent multilevel categorical attributes to be used as inputs for PCA?) and plot observations in the space of the first few principal components indicating levels of some of the categorical attributes of your choosing by the color/shape of the symbol. Perform univariate assessment of associations between outcome we will be modeling and each of the attributes (e.g. t-test or logistic regression for continuous attributes, contingency tables/Fisher exact test/\(\chi^2\) test for categorical attributes). Summarize your observations from these assessments: does it appear that there are predictors associated with the outcome
ynunivariately? Which predictors seem to be more/less relevant?
Okay, let’s load in the dataset and take a look at what the data looks like.
rawData <- read_csv("final-data-train.csv") %>% mutate_if(is.character, as.factor)
## Parsed with column specification:
## cols(
## yn = col_character(),
## id = col_integer(),
## cg = col_double(),
## eg = col_character(),
## re = col_character(),
## ms = col_character(),
## fw = col_double(),
## cl = col_double(),
## hw = col_double(),
## ra = col_character(),
## nc = col_double(),
## se = col_character(),
## ag = col_double(),
## wc = col_character(),
## ne = col_character(),
## oc = col_character(),
## dr = col_character()
## )
head(rawData)
## # A tibble: 6 x 17
## yn id cg eg re ms fw cl hw ra nc se
## <fct> <int> <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 no 177729 1.72 e2 pzv lz 3.33 1.41 2.45 ojk 0.3 vrm
## 2 no 513216 1.83 e2 wto ls 3.27 1.42 3 kvy 1.2 vrm
## 3 no 297005 1.61 e2 wib kh 0.537 1.35 3 gmp 0 vrm
## 4 no 467346 1.62 e2 wib kh 1.92 1.36 2.83 kvy 0.5 vrm
## 5 no 229364 1.59 e2 wto im 3.27 1.36 2.65 kvy 0 vrm
## 6 yes 378290 1.93 e2 wto ls 1.10 1.43 3.32 kvy 0.5 vrm
## # ... with 5 more variables: ag <dbl>, wc <fct>, ne <fct>, oc <fct>,
## # dr <fct>
summary(rawData)
## yn id cg eg re
## no :22165 Min. : 41 Min. :1.313 e2 :26526 ldi: 1286
## yes: 7257 1st Qu.:149610 1st Qu.:1.656 e8 : 866 pzv: 8983
## Median :298736 Median :1.745 e6 : 772 wib: 3443
## Mean :298754 Mean :1.785 e5 : 335 wto:15710
## 3rd Qu.:446830 3rd Qu.:1.867 e3 : 325
## Max. :595175 Max. :4.720 e1 : 318
## (Other): 280
## ms fw cl hw
## lz :9356 Min. : 0.000284 Min. :0.000 Min. :0.000
## kh :7255 1st Qu.: 0.646817 1st Qu.:1.338 1st Qu.:2.830
## im :4142 Median : 1.385567 Median :1.369 Median :3.160
## ls :2891 Mean : 1.770069 Mean :1.286 Mean :3.004
## ow :2491 3rd Qu.: 2.444534 3rd Qu.:1.394 3rd Qu.:3.460
## hf :1593 Max. :26.224451 Max. :1.632 Max. :3.740
## (Other):1694
## ra nc se ag wc
## erm: 33 Min. :0.0000 jtl: 3038 Min. :0.1414 cz: 114
## gmp: 6074 1st Qu.:0.2000 vrm:26384 1st Qu.:0.3162 gf: 1279
## kvy:21454 Median :0.3000 Median :0.3742 li: 794
## ojk: 578 Mean :0.4152 Mean :0.3746 pv: 9912
## tdg: 1283 3rd Qu.:0.6000 3rd Qu.:0.4000 qx: 37
## Max. :1.8000 Max. :0.8185 xv:17286
##
## ne oc dr
## jle: 6416 lr :9637 oi:23467
## kcr: 8487 al :9568 qe: 5955
## rql: 1929 ey :3331
## ydq:12590 jz :2406
## jl :1376
## rk : 844
## (Other):2260
rawData %>% select_if(is.numeric) %>% select(-id) %>% pairs(col = rawData$yn)
Columns ag and cg appear to have pretty good correlation to the target. Something interesting is happening in coumn cl, where many rows have value zero and non-zero quantities seem to be between 1 and about 1.7. It seems like column hw would benefit from log transformation. I don’t think it would benefit from conversion to categorical though, there seems to be some amount of correlation between the value and the target.
for (catRow in select_if(rawData, is.factor)) {
print(table(rawData$yn, catRow))
}
## catRow
## no yes
## no 22165 0
## yes 0 7257
## catRow
## e1 e2 e3 e4 e5 e6 e7 e8
## no 42 21330 116 30 166 240 26 215
## yes 276 5196 209 187 169 532 37 651
## catRow
## ldi pzv wib wto
## no 833 6345 3043 11944
## yes 453 2638 400 3766
## catRow
## ay hf im kh ls lz ow yd
## no 562 1047 3128 5751 1977 7438 1727 535
## yes 312 546 1014 1504 914 1918 764 285
## catRow
## erm gmp kvy ojk tdg
## no 10 4527 16272 448 908
## yes 23 1547 5182 130 375
## catRow
## jtl vrm
## no 1038 21127
## yes 2000 5257
## catRow
## cz gf li pv qx xv
## no 51 502 638 7927 11 13036
## yes 63 777 156 1985 26 4250
## catRow
## jle kcr rql ydq
## no 4803 6354 1445 9563
## yes 1613 2133 484 3027
## catRow
## al ey jl jz lr me pw qt rk ry sb wg zo
## no 8337 2836 1024 1824 6337 45 489 157 611 29 216 260 0
## yes 1231 495 352 582 3300 39 231 104 233 25 78 565 22
## catRow
## oi qe
## no 17633 4532
## yes 5834 1423
First thing worth noting in the categorical features is that target no is about three times more common than a target yes. Feature se seems to have pretty good correlation and is only two levels. It seems also that most target no samples have eg of e2, which may be helpful. There may be some correlations with the other factors but it’s difficult to tell with both the class imbalance and the higher number of factors.
rawData %>% mutate(eg2 = eg == "e2", yn = yn == "yes") %>% select(yn, eg2) %>% cor()
## yn eg2
## yn 1.0000000 -0.3564462
## eg2 -0.3564462 1.0000000
It definitely appears that e2 can provide some value in predicting yn.
In order to fix column cl, I’ll use correlated features to fit a linear regression on the non-zero-valued samples of cl and predict values to replace the zero-valued samples.
rawData %>% select_if(is.numeric) %>% select(-id) %>% filter(cl != 0) %>% cor() %>% .[,"cl"]
## cg fw cl hw nc ag
## 0.43125803 -0.01927038 1.00000000 0.00259574 0.07142344 0.59097593
Looks like cg and ag are our best bet.
clLRFit <- glm(cl ~ cg + ag, data = filter(rawData, cl != 0))
newcl <- predict(clLRFit, newdata = filter(rawData, cl == 0))
summary(clLRFit)
##
## Call:
## glm(formula = cl ~ cg + ag, data = filter(rawData, cl != 0))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.27626 -0.02668 0.01131 0.02147 0.17417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.186429 0.001828 649.06 <2e-16 ***
## cg 0.015981 0.001329 12.03 <2e-16 ***
## ag 0.421856 0.005008 84.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.001220637)
##
## Null deviance: 51.971 on 27564 degrees of freedom
## Residual deviance: 33.643 on 27562 degrees of freedom
## AIC: -106685
##
## Number of Fisher Scoring iterations: 2
plot(clLRFit)
Not an incredible fit, but it will do better than all zeroes.
Now, I’ll create logical features which flag when eg = "e2" and cl == 0, log-transform hw, and replace zero-valued cl samples with the above predicted values. I also create a tbl where the factor columns have been replaced with numeric dummy columns.
trainData <- rawData %>%
mutate(
e2 = eg == "e2",
clz = cl == 0,
hw = log(hw + 1e-6)
)
trainData[trainData$cl == 0, "cl"] <- newcl
trainDataDummies <- as.tibble(dummy.data.frame(as.data.frame(select(trainData, -yn)))) %>%
bind_cols(select(trainData, yn))
With that done, let’s take a look at what a principal components analysis will look like on this data.
pcaFit <- prcomp(select(trainDataDummies, -yn), scale. = TRUE)
plot(pcaFit)
The explained variance by each dimension does not drop off all that quickly.
biplot(pcaFit, pc.biplot = TRUE)
In the first two dimensions, you see large vector components from wcpv, e2, wcxv, cl, and ag.
factorColNames <- colnames(select_if(trainData, is.factor))
for (factorColName in factorColNames) {
pairs(pcaFit$x[, 1:3], col = trainData[[factorColName]])
title(factorColName)
}
The data takes an interesting shape. There is definitely one large mass, and one much smaller mass of points. The small mass is all classified the same yn value. The larger mass also seems to be split into two attached masses.
Let’s try fitting some single variable logistic regressions on this data to get a sense for which numeric features have predictive power on their own.
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLRSV <- tibble()
numericColNames <- colnames(select(select_if(trainData, is.numeric), -id))
for (fold in 1:k) {
for (numericColName in numericColNames) {
singleVarData <- bind_cols(
var = trainData[, numericColName],
yn = trainData[, "yn"]
) %>% mutate(yn = as.numeric(yn) - 1)
trainDataFit <- singleVarData[kfInd != fold, ]
trainDataVal <- singleVarData[kfInd == fold, ]
lrFit <- glm(
yn ~ .,
data = trainDataFit)
lrPredict <- predict(lrFit, newdata = trainDataVal, type = "response")
lrCM <- table(trainDataVal$yn, round(lrPredict))
xvalResultsLRSV <- rbind(xvalResultsLRSV, tibble(
fold = fold,
acc = calcAcc(lrCM),
sens = calcSensitivity(lrCM),
spec = calcSpecificity(lrCM),
model = paste("lr", numericColName, sep = "-")
))
}
}
xvalResultsLRSV %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
Some, such as ag and cl, are actually fairly decent on their own. Several have very low sensitivity. Column nc is sort of middling. Let’s see if a more complicated analysis can provide a better result.
Develop logistic regression model of the outcome
ynas a function of multiple predictors in the model. Which variables are significantly associated with the outcome? Test model performance on multiple splits of data into training and test subsets and summarize it in terms of accuracy/error/sensitivity/specificity.
trainData %>%
select_if(is.numeric) %>% select(-id) %>% cbind(yn = as.numeric(trainData$yn)) %>%
cor() %>% .["yn", ] %>% sort(decreasing = TRUE)
## yn cg ag nc cl
## 1.000000000 0.484006546 0.343500911 0.256036764 0.160414523
## hw fw
## 0.089530953 -0.002996787
Decent correlation with yn and features cg, ag, and nc.
trainData %>%
select_if(is.numeric) %>% select(-id) %>% cbind(yn = as.numeric(trainData$yn)) %>%
cor()
## cg fw cl hw nc
## cg 1.00000000 -0.006696750 0.44554281 0.29491908 0.201160808
## fw -0.00669675 1.000000000 -0.01859891 -0.00201130 0.002041653
## cl 0.44554281 -0.018598913 1.00000000 0.01607019 0.076810683
## hw 0.29491908 -0.002011300 0.01607019 1.00000000 0.028801311
## nc 0.20116081 0.002041653 0.07681068 0.02880131 1.000000000
## ag 0.65612819 -0.011617617 0.60919322 0.01339333 0.190846319
## yn 0.48400655 -0.002996787 0.16041452 0.08953095 0.256036764
## ag yn
## cg 0.65612819 0.484006546
## fw -0.01161762 -0.002996787
## cl 0.60919322 0.160414523
## hw 0.01339333 0.089530953
## nc 0.19084632 0.256036764
## ag 1.00000000 0.343500911
## yn 0.34350091 1.000000000
Some of these features are also fairly well correlated with each other, especially cg and ag.
Let’s try our luck and use all of the numeric and logical features.
lrFit <- glm(
yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz,
data = trainData, family = binomial)
summary(lrFit)
##
## Call:
## glm(formula = yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz,
## family = binomial, data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1973 -0.5002 -0.2972 -0.1190 3.3611
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.301506 0.652877 -8.120 4.65e-16 ***
## cg 7.779193 0.161731 48.099 < 2e-16 ***
## fw -0.001509 0.011694 -0.129 0.89730
## cl -4.522904 0.531707 -8.506 < 2e-16 ***
## hw -0.023662 0.011509 -2.056 0.03979 *
## nc 1.217986 0.045207 26.943 < 2e-16 ***
## ag 1.729410 0.538251 3.213 0.00131 **
## e2TRUE -3.133920 0.058001 -54.032 < 2e-16 ***
## sevrm -2.548837 0.055781 -45.693 < 2e-16 ***
## clzTRUE 0.283731 0.070998 3.996 6.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32872 on 29421 degrees of freedom
## Residual deviance: 18976 on 29412 degrees of freedom
## AIC: 18996
##
## Number of Fisher Scoring iterations: 6
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLR <- tibble()
for (fold in 1:k) {
trainDataFit <- trainData[kfInd != fold, ]
trainDataVal <- trainData[kfInd == fold, ]
lrFit <- glm(
yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz,
data = trainDataFit, family = binomial)
lrPredict <- predict(lrFit, newdata = trainDataVal, type = "response")
lrCM <- table(trainDataVal$yn, round(lrPredict))
xvalResultsLR <- rbind(xvalResultsLR, tibble(
fold = fold,
acc = calcAcc(lrCM),
sens = calcSensitivity(lrCM),
spec = calcSpecificity(lrCM),
model = "lr"
))
}
xvalResultsLR %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
This logistic regression performs surprisingly well, scoring around a 90% validation accuracy with similar levels of sensitivity and specificity.
Assess the impact/significance of pairwise interaction terms for all pairwise combinations of covariates used in the model and report the top ten that most significantly improve model fit.
In order to assess the value of interaction terms in general, let’s look at what a logistic regression fit looks like using all features and all interaction terms.
numSubsetData <- bind_cols(
select(trainData, yn),
select_if(trainData, is.numeric),
) %>% select(-id)
lmFit <- glm(yn ~ . + .*., data = numSubsetData, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lmFit)
##
## Call:
## glm(formula = yn ~ . + . * ., family = binomial, data = numSubsetData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.7107 -0.6219 -0.4481 -0.2226 2.9260
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.087e+01 6.371e+00 4.846 1.26e-06 ***
## cg -1.075e+01 4.355e+00 -2.469 0.013562 *
## fw 9.310e-01 3.634e-01 2.562 0.010408 *
## cl -2.188e+01 5.317e+00 -4.115 3.87e-05 ***
## hw -4.255e-01 3.672e-01 -1.159 0.246556
## nc 7.372e+00 1.352e+00 5.453 4.97e-08 ***
## ag -5.422e+01 1.186e+01 -4.571 4.85e-06 ***
## cg:fw -1.718e-01 8.333e-02 -2.062 0.039227 *
## cg:cl 5.930e+00 3.402e+00 1.743 0.081291 .
## cg:hw 2.838e-01 1.392e-01 2.039 0.041475 *
## cg:nc 1.112e+00 3.319e-01 3.349 0.000810 ***
## cg:ag 2.125e+01 2.712e+00 7.835 4.69e-15 ***
## fw:cl -6.171e-01 2.950e-01 -2.092 0.036480 *
## fw:hw 3.397e-03 6.369e-03 0.533 0.593804
## fw:nc 5.542e-04 2.344e-02 0.024 0.981139
## fw:ag 5.854e-01 2.879e-01 2.034 0.041998 *
## cl:hw 2.069e-01 2.842e-01 0.728 0.466671
## cl:nc -4.684e+00 1.090e+00 -4.296 1.74e-05 ***
## cl:ag 1.530e+01 8.473e+00 1.806 0.070872 .
## hw:nc -6.918e-02 2.344e-02 -2.952 0.003159 **
## hw:ag -7.496e-01 2.102e-01 -3.567 0.000362 ***
## nc:ag -4.662e+00 1.068e+00 -4.367 1.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32872 on 29421 degrees of freedom
## Residual deviance: 24330 on 29400 degrees of freedom
## AIC: 24374
##
## Number of Fisher Scoring iterations: 6
Several have high predictive power. Using feature selection methods, which ones should we pick?
# taken from homework 5, which was in turn taken from ISLR lab 6.5.3
predict.regsubsets <- function (object, newdata, id, ...){
form=as.formula(object$call [[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names (coefi)
mat[,xvars] %*% coefi
}
selectMethods <- c("exhaustive", "backward", "forward", "seqrep")
fitMetrics <- c("rsq","rss","adjr2","bic")
nVars <- ncol(numSubsetData) - 1 + ncol(combn(colnames(select(numSubsetData, -yn)), 2))
selectWhichInteract <- NULL
selectMetricsInteract <- NULL
for (method in selectMethods) {
# fit using each method of determining subsets and save summary of results
selectFit <- regsubsets(yn ~ . + .*., numSubsetData, method=method, nvmax=nVars)
selectSummary <- summary(selectFit)
# extract which features are used for each number of variables and save with method
selectWhich <- as.tibble(selectSummary$which)
selectWhich$method <- method
selectWhich$vars <- 1:nVars
selectWhichInteract <- rbind(selectWhichInteract, selectWhich)
# extract fit quality metrics for each number of variables and save with method
selectMetrics <- as.tibble(selectSummary[fitMetrics])
selectMetrics$method <- method
selectMetrics$vars <- 1:nVars
selectMetricsInteract <- rbind(selectMetricsInteract, selectMetrics)
}
selectMetricsInteract <- selectMetricsInteract %>%
gather(key = metric, value = value, fitMetrics)
ggplot(selectMetricsInteract, aes(x = vars,y = value,shape = method,colour = method)) +
geom_path() +
geom_point() +
facet_wrap(~metric,scales="free") +
theme(legend.position="top")
plotWhich <- selectWhichInteract %>%
gather(key = param, value = selected, -c(method, vars)) %>%
ggplot(aes(x = vars, y = param, fill = selected)) +
geom_tile() +
facet_wrap(~method)
selectWhichInteract %>%
gather(key = param, value = selected, -c(method, vars)) %>%
group_by(param) %>%
summarize(sum = sum(selected)) %>%
arrange(desc(sum)) %>%
head(15) %>%
ggplot(aes(x = fct_reorder(param, sum, .desc = TRUE), y = sum)) + geom_col()
lrFit <- glm(
yn ~ cg + nc + cl*nc + hw + cl + cg*ag + cl*ag + ag + cg*hw + e2 + clz,
data = trainData, family = binomial)
summary(lrFit)
##
## Call:
## glm(formula = yn ~ cg + nc + cl * nc + hw + cl + cg * ag + cl *
## ag + ag + cg * hw + e2 + clz, family = binomial, data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.2675 -0.5414 -0.3568 -0.1694 3.2445
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 20.57883 3.74696 5.492 3.97e-08 ***
## cg 1.37566 0.95454 1.441 0.150
## nc 9.75867 1.34784 7.240 4.48e-13 ***
## cl -17.08937 3.05915 -5.586 2.32e-08 ***
## hw 0.12366 0.15560 0.795 0.427
## ag -68.79280 9.11671 -7.546 4.50e-14 ***
## e2TRUE -2.89606 0.05376 -53.875 < 2e-16 ***
## clzTRUE 0.31418 0.06685 4.700 2.60e-06 ***
## nc:cl -6.25878 0.98017 -6.385 1.71e-10 ***
## cg:ag 14.90942 2.41109 6.184 6.26e-10 ***
## cl:ag 32.67074 7.58600 4.307 1.66e-05 ***
## cg:hw -0.10209 0.10367 -0.985 0.325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32872 on 29421 degrees of freedom
## Residual deviance: 21025 on 29410 degrees of freedom
## AIC: 21049
##
## Number of Fisher Scoring iterations: 6
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLRInt <- tibble()
for (fold in 1:k) {
trainDataFit <- trainData[kfInd != fold, ]
trainDataVal <- trainData[kfInd == fold, ]
lrFit <- glm(
yn ~ cg + nc + cl*nc + hw + cl + cg*ag + cl*ag + ag + e2 + clz,
data = trainDataFit, family = binomial)
lrPredict <- predict(lrFit, newdata = trainDataVal, type = "response")
lrCM <- table(trainDataVal$yn, round(lrPredict))
xvalResultsLRInt <- rbind(xvalResultsLRInt, tibble(
fold = fold,
acc = calcAcc(lrCM),
sens = calcSensitivity(lrCM),
spec = calcSpecificity(lrCM),
model = "lr-int"
))
}
xvalResultsLRInt %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
moreData <- trainDataDummies %>%
mutate(
clnc = cl * nc,
cgag = cg * ag,
clag = cl * ag,
cghw = cg * hw
)
Fit linear discriminant analysis model of the outcome
ynas a function of the rest of covariates in the dataset. Feel free to decide whether you want to use all of them or a subset of those. Test resulting model performance on multiple splits of the data into training and test subsets, summarize it in terms of accuracy/error/sensitivity/specificity and compare them to those obtained for logistic regression.
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLDA <- tibble()
for (fold in 1:k) {
trainDataFit <- trainData[kfInd != fold, ]
trainDataVal <- trainData[kfInd == fold, ]
ldaFit <- lda(
yn ~ cg + fw + cl + hw + nc + ag + e2 + se,
data = trainDataFit)
ldaPredict <- predict(ldaFit, newdata = trainDataVal)
ldaCM <- table(trainDataVal$yn, as.numeric(ldaPredict$x > 0))
xvalResultsLDA <- rbind(xvalResultsLDA, tibble(
fold = fold,
acc = calcAcc(ldaCM),
sens = calcSensitivity(ldaCM),
spec = calcSpecificity(ldaCM),
model = "lda"
))
}
xvalResultsLDA %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
LDA performs poorly in comparison to other classification methods on this dataset. The sensitivity especially is very low.
In our experience attempting to fit quadratic discriminant analysis model of the categorical outcome
ynon this data results in a rank deficiency related error. Determine on how to correct this error and report resulting model training and test error/accuracy/etc. and how it compares to LDA and logistic regression above.
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsQDA <- tibble()
for (fold in 1:k) {
trainDataFit <- trainData[kfInd != fold, ]
trainDataVal <- trainData[kfInd == fold, ]
qdaFit <- qda(
yn ~ cg + fw + cl + hw + nc + ag + e2 + se,
data = trainDataFit)
qdaPredict <- predict(qdaFit, newdata = trainDataVal)
qdaCM <- table(trainDataVal$yn, qdaPredict$class)
xvalResultsQDA <- rbind(xvalResultsQDA, tibble(
fold = fold,
acc = calcAcc(qdaCM),
sens = calcSensitivity(qdaCM),
spec = calcSpecificity(qdaCM),
model = "qda"
))
}
xvalResultsQDA %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
QDA performs better than LDA, but not by much.
Develop random forest model of outcome
yn. Present variable importance plots and comment on relative importance of different attributes in the model. Did attributes showing up as more important in random forest model also appear as significantly associated with the outcome by logistic regression? Test model performance on multiple splits of data into training and test subsets, compare test and out-of-bag error estimates, summarize model performance in terms of accuracy/error/sensitivity/specificity and compare to the performance of logistic regression and LDA models above.
Let’s try fitting one on a sampled dataset first to see what the important features are.
smallSample <- trainData[sample(1:nrow(trainData), nrow(trainData) / 10, replace = TRUE), ]
rfFit <- randomForest(select(smallSample, -yn), smallSample$yn)
rfFit$importance %>% as.tibble(rownames = "feature") %>%
arrange(desc(MeanDecreaseGini)) %>% head(15) %>%
ggplot(aes(x = fct_reorder(feature, MeanDecreaseGini, .desc = TRUE), y = MeanDecreaseGini)) +
geom_col()
By far the most important feature was cg. After that, e2, eg, ag, oc, cl, se, … etc, are also selected as important features.
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsRF <- tibble()
for (fold in 1:k) {
trainDataFit <- trainData[kfInd != fold, ]
trainDataVal <- trainData[kfInd == fold, ]
rfFit <- randomForest(select(trainDataFit, -yn), trainDataFit$yn)
rfPredict <- predict(rfFit, newdata = select(trainDataVal, -yn))
rfCM <- table(trainDataVal$yn, rfPredict)
xvalResultsRF <- rbind(xvalResultsRF, tibble(
fold = fold,
acc = calcAcc(rfCM),
sens = calcSensitivity(rfCM),
spec = calcSpecificity(rfCM),
model = "rf"
))
}
xvalResultsRF %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
The random forest performs very well. Higher than 92% accuracy with slightly better sensitivity than specificity.
Develop SVM model of categorical outcome
yndeciding on the choice of kernel, cost, etc. that appear to yield better performance. Test model performance on multiple splits of data into training and test subsets, summarize model performance in terms of accuracy/error/sensitivity/specificity and compare to the performance of the rest of the models developed above (logistic regression, LDA, random forest).
I’ll just use a small subset of the data to tune the svm parameters.
smallSample <- trainData[sample(1:nrow(trainDataDummies), nrow(trainDataDummies) / 10, replace = TRUE), ]
smallSample <- smallSample %>% select_if(negate(is.factor)) %>%
bind_cols(smallSample %>% select(yn)) %>% select(-id)
# smallSample <- smallSample[, colSums(smallSample %>% select_if(is.numeric)) != 0]
costs <- c(0.3, 0.5, 0.8, 1)
tune.out <- tune(svm, yn ~ ., data = smallSample,
kernel = "linear", scale = TRUE,
ranges = list(cost = costs))
After running this a few times, a cost of 0.008 seems to be the consensus pick for the best.
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.8
##
## - best performance: 0.1339168
##
## - Detailed performance results:
## cost error dispersion
## 1 0.3 0.1345959 0.01557704
## 2 0.5 0.1342569 0.01542513
## 3 0.8 0.1339168 0.01623091
## 4 1.0 0.1339168 0.01623091
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsSVM <- tibble()
for (fold in 1:k) {
trainDataFit <- trainData[kfInd != fold, ] %>%
select_if(negate(is.factor)) %>%
bind_cols(trainData[kfInd != fold, ] %>% select(yn)) %>% select(-id)
trainDataVal <- trainData[kfInd == fold, ] %>% mutate(yn = as.numeric(yn) - 1) %>%
select_if(negate(is.factor)) %>%
bind_cols(trainData[kfInd == fold, ] %>% select(yn)) %>% select(-id)
svmFit <- svm(yn ~ .,
data = trainDataFit, kernel = "linear", scale = TRUE,
cost = 0.8)
svmPredict <- predict(svmFit, trainDataVal)
svmCM <- table(trainDataVal$yn, svmPredict)
xvalResultsSVM <- rbind(xvalResultsSVM, tibble(
fold = fold,
acc = calcAcc(svmCM),
sens = calcSensitivity(svmCM),
spec = calcSpecificity(svmCM),
model = "lin-svm"
))
}
xvalResultsSVM %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
The SVM appears to perform well, but still not as well as the random forest.
Experiment with fitting neural network models of categorical outcome
ynfor this data and evaluate their performance on different splits of the data into training and test. Compare model performance to that for the rest of classifiers developed above.
I think the neuralnet package is a bit of a joke so I used the keras interface for R instead. First, because the classes are imbalanced, I’ll sample from the “yes” rows to ensure there’s an equal number of samples with each target response.
yesRows <- moreData %>% filter(yn == "yes")
resampData <- yesRows %>%
.[sample(1:nrow(yesRows), nrow(moreData) - 2 * nrow(yesRows), replace=TRUE), ] %>%
bind_rows(moreData) %>%
mutate(ynn = as.numeric(yn) - 1) %>%
select_if(negate(is.factor)) %>%
select(-id)
resampData %>% count(ynn)
## # A tibble: 2 x 2
## ynn n
## <dbl> <int>
## 1 0 22165
## 2 1 22165
After playing around for a bit, I settled on this model structure. Although to be fair, changing parameters only affects the accuracy really at the margins.
model <- keras_model_sequential()
model %>%
layer_dense(units = 128, activation = "relu", input_shape = ncol(resampData) - 1, name = "dense_1") %>%
layer_dropout(rate = 0.3, name = "dropout_1") %>%
layer_dense(units = 32, activation = "relu", name = "dense_2") %>%
layer_dropout(rate = 0.3, name = "dropout_2") %>%
layer_dense(units = 1, activation = "sigmoid", name = "dense_final")
summary(model)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 128) 8320
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 128) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 32) 4128
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 32) 0
## ___________________________________________________________________________
## dense_final (Dense) (None, 1) 33
## ===========================================================================
## Total params: 12,481
## Trainable params: 12,481
## Non-trainable params: 0
## ___________________________________________________________________________
It’s a binary classification problem, so I’m using a sigmoid activation function on the final neuron, binary crossentropy as my loss function and the adam optimizer.
model %>% compile(
loss = loss_binary_crossentropy,
optimizer = optimizer_adam(),
metrics = c('accuracy')
)
history <- model %>% fit(
x = as.matrix(resampData %>% select(-ynn)),
y = as.matrix(resampData %>% select(ynn)),
epochs = 50, batch_size = 512, validation_split = 0.2
)
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(resampData)))
xvalResultsNN <- tibble()
for (fold in 1:k) {
trainDataFit <- resampData[kfInd != fold, ]
trainDataVal <- resampData[kfInd == fold, ]
model <- keras_model_sequential()
model %>%
layer_dense(units = 128, activation = "relu", input_shape = ncol(trainDataFit) - 1, name = "dense_1") %>%
layer_dropout(rate = 0.3, name = "dropout_1") %>%
layer_dense(units = 32, activation = "relu", name = "dense_2") %>%
layer_dropout(rate = 0.3, name = "dropout_2") %>%
layer_dense(units = 1, activation = "sigmoid", name = "dense_final")
model %>% compile(
loss = loss_binary_crossentropy,
optimizer = optimizer_adam(),
metrics = c('accuracy')
)
history <- model %>% fit(
x = as.matrix(trainDataFit %>% select(-ynn)),
y = as.matrix(trainDataFit %>% select(ynn)),
epochs = 50, batch_size = 512, verbose = 0
)
nnPredict <- model %>% predict_classes(as.matrix(trainDataVal %>% select(-ynn)))
nnCM <- table(trainDataVal$ynn, nnPredict)
xvalResultsNN <- rbind(xvalResultsNN, tibble(
fold = fold,
acc = calcAcc(nnCM),
sens = calcSensitivity(nnCM),
spec = calcSpecificity(nnCM),
model = "nn"
))
}
xvalResultsNN %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
The results are pretty good. Fairly high sensitivity, which some regularization could help with.
xvalResults <- bind_rows(
xvalResultsLR,
xvalResultsLRInt,
xvalResultsLDA,
xvalResultsQDA,
xvalResultsRF,
xvalResultsSVM,
xvalResultsNN
)
xvalResults %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = fct_reorder(model, value, .desc = TRUE), y = value, col = metric)) +
geom_boxplot()
In comparison with other methods, it falls a little bit short. Logistic regression, linear basis SVM, and random forest all have higher cross-validated test accuracy.
Compare performance of the models developed above (logistic regression, LDA, random forest, SVM) in terms of their accuracy, error and sensitivity/specificity. Comment on differences and similarities between them.
xvalResults <- bind_rows(
xvalResultsLR,
xvalResultsLRInt,
xvalResultsLDA,
xvalResultsQDA,
xvalResultsRF,
xvalResultsSVM,
xvalResultsNN
)
xvalResults %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = fct_reorder(model, value, .desc = TRUE), y = value, col = metric)) +
geom_boxplot()
In all, the random forest (without additional added parameters) is the best model, and what I’ll choose to make my test set predictions on.
Decide on the model that performs the best and use it to make predictions for the test dataset. This is the dataset that is provided separately from training data without the outcome
ynthat we are modeling here. Upload resulting predictions in comma-separated values (CSV) format into the Canvas website. Please check sample files with test dataset predictions for the expected format of the *.csv file with predictions.
Before trying a prediction, let’s take one last look at
rfFit <- randomForest(select(moreData, -yn), moreData$yn)
rfFit$importance %>% as.tibble(rownames = "feature") %>%
arrange(desc(MeanDecreaseGini)) %>%
head(20) %>%
ggplot(aes(x = fct_reorder(feature, MeanDecreaseGini, .desc = TRUE), y = MeanDecreaseGini)) +
geom_col()
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(moreData)))
xvalResultsRFM <- tibble()
for (fold in 1:k) {
trainDataFit <- moreData[kfInd != fold, ]
trainDataVal <- moreData[kfInd == fold, ]
rfmFit <- randomForest(select(trainDataFit, -yn), trainDataFit$yn)
rfmPredict <- predict(rfmFit, newdata = select(trainDataVal, -yn))
rfmCM <- table(trainDataVal$yn, rfmPredict)
xvalResultsRFM <- rbind(xvalResultsRFM, tibble(
fold = fold,
acc = calcAcc(rfmCM),
sens = calcSensitivity(rfmCM),
spec = calcSpecificity(rfmCM),
model = "rf-m"
))
}
xvalResultsRFM %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()
It appears that the random forest without extra data still performs the best. With that settled, let’s make some predictions.
xvalResults <- bind_rows(
xvalResultsLR,
xvalResultsLRInt,
xvalResultsRF,
xvalResultsRFM,
xvalResultsSVM,
xvalResultsNN
)
xvalResults %>% gather(key = metric, value = value, -c(fold, model)) %>%
ggplot(aes(x = fct_reorder(model, value, .desc = TRUE), y = value, col = metric)) +
geom_boxplot()
rawTestData <- read_csv("final-data-test.csv") %>% mutate_if(is.character, as.factor)
## Parsed with column specification:
## cols(
## id = col_integer(),
## cg = col_double(),
## eg = col_character(),
## re = col_character(),
## ms = col_character(),
## fw = col_double(),
## cl = col_double(),
## hw = col_double(),
## ra = col_character(),
## nc = col_double(),
## se = col_character(),
## ag = col_double(),
## wc = col_character(),
## ne = col_character(),
## oc = col_character(),
## dr = col_character()
## )
newcl <- predict(clLRFit, newdata = filter(rawTestData, cl == 0))
testData <- rawTestData %>%
mutate(
e2 = eg == "e2",
clz = cl == 0,
hw = log(hw + 1e-6)
)
testData[testData$cl == 0, "cl"] <- newcl
testDataDummies <- as.tibble(dummy.data.frame(as.data.frame(testData, -yn)))
rfFit <- randomForest(select(trainData, -yn), trainData$yn)
rfPredict <- predict(rfFit, newdata = testData)
submission <- testData %>% select(id) %>% mutate(GenghisKhan = rfPredict)
submission %>% write_csv("cs63_final_results.csv")
This is not really a problem per se but rather a criterion we will go by assessing quality of your predictions for the test dataset. You get these four points if your predictions for test dataset are better than those obtained from a fair coin flip (already shown in leaderboard and as examples of the file format for predictions upload) by at least 10% on all four metrics shown in the leaderboard (accuracy, sensitivity, specificity and precision). But then predictions by the coin flip should not be very difficult to improve upon.
Because during previous offerings of this course there were always several posts on piazza regarding how long it takes to fit various classifiers to the final exam dataset we have added this note here.
First of all, we most definitely do not expect you to have to buy capacity from AWS to complete this assignment. You certainly can if you want to, but this course is not about that and this dataset is really not that big to require it. Something reasonable/useful can be accomplished for this data with middle of the road hardware. For instance, knitting of the entire official solution for the final exam on 8Gb RAM machine with two i5-7200u cores takes about an hour using single-threaded R/Rstudio and this includes both extra points problems as well as various assessments of the performance of different models as function of data size and so on.
Second, your solution should not take hours and hours to compile. If it does, it could be that it is attempting to do too much, or something is implemented inefficiently, or just plain incorrectly - it is impossible for us to comment on this until we see the code when we grade it. In general, it is often very prudent to “start small” – fit your model on a random subset of data small enough for the model fitting call to return immediately, check how model performance (both in terms of error and time it takes to compute) scales with the size of the data you are training it on (as you increase it in size, say, two-fold several times), for tuning start with very coarse grid of parameter values and given those results decide what it right for you, etc.
Lastly, making the decision about what is right for the problem at hand, how much is enough, etc. is inherent in this line of work. If you choose to conduct model tuning on a subset of the data - especially if you have some assessment of how the choice of tuning parameter and test error is affected by the size of training dataset - it could be a very wise choice. If it is more efficient for you to knit each problem separately, by all means feel free to do that - just remember to submit each .Rmd and HTML file that comprises your entire solution. On that note, if you end up using any of the unorthodox setups for your calculations (e.g. AWS, parallel processing, multiple machines, etc. - none of which are essential for solving it correctly) please be sure that when we grade we have every relevant piece of code available - we won’t be able to grade your work if we are not clear about how the results were obtained.
In the end, the final exam asks you to assess performance of several classification technologies on a new dataset, decide on which classifier is the best and use it to make predictions for the test data. It is very much up to you how exactly you want to go about it. There could be many versions of correct and informative solution for that (as there could be just as many if not more that are completely wrong).
As always, best of luck - we are practically done here!